Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(api): restrict arbitrary input nesting #8917

Merged
merged 5 commits into from
Apr 30, 2024

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Apr 9, 2024

Previously we allowed arbitrary nesting of input expressions for various
API methods like table.join, table.group_by, etc. While this was
backward compatible with 8.0.0 it also exposes additional ambiguity
which can be confusing for users and difficult to reason about.

This change restricts the nesting of input expressions to a
"single level" of nesting with the exception of the first positional
argument which can be a list of expressions, but only if there are no
more positional arguments. Examples:

t.select([t.foo, t.bar])  # OK
t.select(t.foo, t.bar)    # OK
t.select([t.foo], t.bar)  # Error
t.select(t.foo, name=t.bar)  # OK
t.select([t.foo], name=t.bar)  # OK

Aims to resolve #7819 and #8304, depends on #8884

@@ -293,6 +281,19 @@ def _bind_reduction_filter(self, where):

return where.resolve(self)

def bind(self, *args, **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstrings and must be covered with tests, but first we should have a consensus on the API behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are there but the docstring is missing.

@kszucs
Copy link
Member Author

kszucs commented Apr 9, 2024

There are some ambiguities around the first arguments which can be a list of expressions:

table.select(["a", ["b"]])

# constructs

r0 := UnboundTable: table
  a int8
  b int16
  c int32
  d int64
  e float32
  f float64
  g string
  h boolean
  i timestamp
  j date
  k time

Project[r0]
  a:      r0.a
  ('b',): ['b']

a is interepreted as a column whereas ["b"] interpreted as an array literal.

@kszucs
Copy link
Member Author

kszucs commented Apr 9, 2024

Apparently we also have a backend test case using a mapping as a single positional argument.

@cpcloud
Copy link
Member

cpcloud commented Apr 12, 2024

Why can't we keep

t.select([t.foo], t.bar)

?

It doesn't seem meaningfully different from allowing

t.select([t.foo], bar=t.bar)

@kszucs
Copy link
Member Author

kszucs commented Apr 12, 2024

If we allow:

t.select([t.foo], t.bar)

then should we allow

t.select([t.foo], [t.bar])

as well?

@cpcloud
Copy link
Member

cpcloud commented Apr 12, 2024

I think so. It's really things like t.select([[...]], [[...]]) that should be disallowed.

@kszucs
Copy link
Member Author

kszucs commented Apr 12, 2024

My preference would be to either provide args as variadics or as a single list-like argument. Even then we have confusion when to treat list-like inputs as literals vs value list.

@cpcloud cpcloud added this to the 9.0 milestone Apr 12, 2024
@jcrist
Copy link
Member

jcrist commented Apr 15, 2024

We talked today during triage and decided that at minimum we want to remove the support for arbitrarily nested expressions (as Phillip noted above).

I actually agree with the initial proposal at the top (only accept lists of column refs if select is a 1-arg function, otherwise treat it as variadic). Coincidentally this matches how polars select works, so there's some ecosystem-consistency there. The trick with doing this as a breaking change though is that previously working code may instead silently be treated as an array literal.

With this PR, what happens here? Does this error, or does it silently coerce these to array literals?

t.select(["a", "b"], ["c", "d"])  # is this two array literals or an error?

t.select("a", "b", ["c", "d"])   # is ["c", "d"] an array literal, or does this error?

Either way, I think we definitely want to forbid the case that phillip brought up before 9.0. I'm not opposed to the full changes here either, I think they make things more consistent, but we can always do that later IMO in 10.0.

@kszucs
Copy link
Member Author

kszucs commented Apr 17, 2024

t.select(["a", "b"], ["c", "d"])  # is this two array literals or an error?

yes, two array literals, not erroring

t.select("a", "b", ["c", "d"])   # is ["c", "d"] an array literal, or does this error?

yes, an array literal, not erroring

@kszucs
Copy link
Member Author

kszucs commented Apr 17, 2024

I would prefer deciding what should be the ideal long term behaviour then extend that to be backward compatible and either raise a deprecation warning or provide a config option to keep the previous behaviour.

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2024

Let's keep the full changeset here, and make sure to have both a BREAKING CHANGE bit as well as documentation on the behavior in the relevant APIs (select and mutate).

@kszucs
Copy link
Member Author

kszucs commented Apr 19, 2024

Another thing I'm not sure about is whether we should do the dereferencing here or not? Like should t.bind(totally_different_table.col) raise?

@cpcloud
Copy link
Member

cpcloud commented Apr 19, 2024

Seems like it might keep bind more flexible and easier to maintain if we dereference elsewhere.

@kszucs
Copy link
Member Author

kszucs commented Apr 19, 2024

Actually it could help code consolidation. I can try it after the dereferencer improvement gets merged:

# TODO(kszucs): should use (table, *args, **kwargs) instead to avoid interpreting
# nested inputs
def bind(table: Table, value: Any, int_as_column=False) -> Iterator[ir.Value]:
def bind(table: Table, value) -> Iterator[ir.Value]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • privatize to _bind()?
  • make the API distinction that this takes a single arg, and call this _bind_one()?
  • t.bind(othertable) feels not obvious what is supposed to happen (t.bind(t) feels acceptable and we could keep it). Should we error if the two tables are different? They can always do t.bind(othertable.columns) to get the old behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't seem like hard blockers (we try to reserve requesting changes for hard blockers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that workflow makese sense, thanks! Indeed these requests aren't hard blockers. I think my request for test cases is a blocker for me though, see my comment down there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.bind() also dereference the value to t now. bind() simply coerces the input to a value expression and remains a private function.

ibis/tests/expr/test_table.py Show resolved Hide resolved
@cpcloud
Copy link
Member

cpcloud commented Apr 25, 2024

We should probably kill destructure() now that proper alternatives (lift) are available.

@cpcloud
Copy link
Member

cpcloud commented Apr 25, 2024

I'm going to rip out destructure in a separate PR.

@kszucs
Copy link
Member Author

kszucs commented Apr 26, 2024

I'm going to rip out destructure in a separate PR.

Updated the test case to expect an input error rather than an integrity error. We should update unwrap_aliases() to raise ibis input error instead of integrity error too.

@kszucs kszucs force-pushed the biiind branch 2 times, most recently from cda3364 to 4b9d544 Compare April 26, 2024 08:00
ibis/tests/expr/test_table.py Outdated Show resolved Hide resolved
@kszucs kszucs force-pushed the biiind branch 3 times, most recently from 6a92a74 to 472f5c2 Compare April 26, 2024 17:18
@cpcloud cpcloud requested a review from NickCrews April 26, 2024 18:14
ibis/expr/builders.py Outdated Show resolved Hide resolved
@cpcloud cpcloud added the refactor Issues or PRs related to refactoring the codebase label Apr 26, 2024
@cpcloud
Copy link
Member

cpcloud commented Apr 28, 2024

@NickCrews Can you review this again? This is a remaining blocker for the 9.0 release which we'd like to get out early next week.

Previously we allowed arbitrary nesting of input expressions for various
API methods like `table.join`, `table.group_by`, etc. While this was
backward compatible with `8.0.0` it also exposes additional ambiguity
which can be confusing for users and difficult to reason about.

This change restricts the nesting of input expressions to a
"single level" of nesting with the exception of the first positional
argument which can be a list of expressions, but only if there are no
more positional arguments. Examples:

```python
t.select([t.foo, t.bar])  # OK
t.select(t.foo, t.bar)    # OK
t.select([t.foo], t.bar)  # Error
t.select(t.foo, name=t.bar)  # OK
t.select([t.foo], name=t.bar)  # OK
```
@NickCrews
Copy link
Contributor

I just took a long look.

Most things look great, a few nits, but the one substantial thing is the t.bind([1,"a"]) behavior. I have an inline comment above. Can someone fill me in on the context there a little more? I don't think this is a blocker for me.

@cpcloud
Copy link
Member

cpcloud commented Apr 29, 2024

@NickCrews Thanks for doing a deep dive on tests. There were indeed a few cases that weren't tested and also the lambda _: 2 case wasn't yet implemented. Appreciate the feedback!

@NickCrews
Copy link
Contributor

Cool this looks great then!

@cpcloud
Copy link
Member

cpcloud commented Apr 30, 2024

@NickCrews Can you approve the PR?

@NickCrews
Copy link
Contributor

Sorry I was on the mobile app where the approval status isn't even shown so i didn't know there was anything to do

@cpcloud cpcloud merged commit fd35b66 into ibis-project:main Apr 30, 2024
82 checks passed
@cpcloud cpcloud deleted the biiind branch April 30, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Issues or PRs related to refactoring the codebase
Projects
None yet
4 participants